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Abstract 

Memory-based  reasoning  (MBR)  is  a  technique  that  makes  intensive  use  of  memory 
to  recall  some  specific  episodes  from  the  past  for  problem  solving.  It  is  used  in  this 
research  to  predict  protein  structures  based  on  112  known  structures  sdected  from  the 
Brookhaven  Protein  Databank.  The  ^  and  V*  angles  of  each  amino  acid  in  a  protein  are 
used  to  represent  its  3-D  structtire.  For  this  particular  problem,  we  extend  MBR  to 
include  a  recursive  procedure  to  refine  its  initial  prediction  and  a  varying  ‘‘window”  size 
to  take  into  account  the  interaction  between  amino  acids  apart  from  different  distances 
along  the  amino  acid  sequence.  The  system  inq>lemented,  PHI-PSI,  has  been  tested 
with  aU  the  available  data.  It  does  better  than  distribution-based  guesses  for  most  of 
the  ^  and  ^  angle  values. 


1  Motivation  and  Introduction 


When  faced  with  a  problem,  what  should  we  do  if  we  do  not  have  enough  domain  knowledge 
(“rules”)  to  solve  it  but  do  have  a  set  of  examples?  Problems  of  this  kind  abound  in 
our  daily  life  as  well  as  in  scientific  investigations.  In  this  research  we  take  the  protein 
structure  prediction  problem  as  our  vehicle  of  investigation,  and  focus  on  the  development 
of  a  computational  model  for  solving  such  problems. 
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It  is  known  that  all  proteins  in  all  species,  from  bacteria  to  hnmans,  are  composed  of  the 
same  set  of  20  amino  acids  and  that  every  protein  has  a  unique  amino  acid  sequence  which 
specifies  its  three-dimensional  structure.  It  is  now  fairly  easy  to  determine  a  protein’s  amino 
acid  sequence  or  even  to  chemically  synthesize  a  new  sequence,  but  extremely  difficult  to 
determine  its  3-D  structure.^  Automatic  determination  of  protein  structure  is  of  great 
scientific  and  practical  value  because  it  is  closely  related  to  protein  function.  Though  we 
know  the  amino  acid  sequences  for  thousands  of  proteins,  we  only  know  the  structures  for 
a  few  hundred  of  them.  And  the  principles  underlying  the  correspondence  between  the 
structure  and  the  amino  acid  sequence  are  poorly  understood.  An  interesting  question 
is:  How  do  we  use  the  information  hidden  in  the  known  structures  to  help  predict  the 
unknown? 

Memory-based  reasoning  (MBR)  [14]  assumes  that  the  intensive  use  of  memory  to 
recall  specific  episodes  from  the  past  should  be  the  foundation  of  machine  reasoning.  It  can 
make  use  of  massively  paraUel  hardware  such  as  the  Connection  Machine  [5]  to  produce 
an  efficient  implementation.  Given  a  problem,  an  MBR  system  “recalls”  all  the  precedents 
m  its  memory  that  bear  some  resemblance  to  the  problem  and  derives  a  solution  based  on 
those  retrieved  precedents  through  some  decision-making  process.  Based  on  this  idea,  we 
developed  the  system  PHI-PSI  which  makes  use  of  known  protein  structures  to  predict  the 
structure  of  proteins  of  which  we  only  know  the  amino  acid  sequences. 

In  this  paper  we  discuss  the  design  principles  and  initial  results  of  PHI-PSI.  We  also 
include  brief  background  information  on  the  biology  needed  to  understand  this  application 
(some  of  which  is  put  in  footnotes).  We  conclude  with  a  discussion  about  the  related  work 
and  plans  for  future  research. 

2  Amino  Acids  and  Their  and  ip  Angles 

Proteins  are  composed  of  linear  sequences  of  amino  acids.  There  are  20  different  types  of 
amino  acids,  they  are  the  “alphabet”  of  all  the  proteins.  The  20  amino  adds  differ  from  each 
other  only  in  their  side-chains  (a  group  of  atoms)  which  determine  their  physical  properties 
(see  figure  1  (loft)).  According  to  these  properties,  amino  acids  can  be  classified  into 
similar/dissimilar  groups,  such  as  small/large,  polar/non-polar,  hydrophobic/hydrophific, 
etc.  In  a  protein,  the  carboxyl  group  of  one  amino  acid  is  jdned  to  the  amino  group  of 
another  amino  acid  by  a  peptide  bond.  Many  amino  acids,  usually  a  hundred  or  more,  are 
joined  by  peptide  bonds  to  form  a  polypeptide  chain  (also  caUed  a  primary  sequence-,  each 
amino  acid  in  it  is  sometimes  called  a  residue),  as  shown  in  figure  1  (right).  Figure  2 
(left)  shows  the  spatial  relations  for  atoms  joined  by  two  peptide  bonds.  The  atoms 
between  two  C^s  are  basically  on  a  plane  and  the  bond  lengths  are  essentially  fixed. 

^  Right  how  mainly  through  ci^stallography,  which  is  at  best  time  consuming  -  typicaUy  ten  man-years 
per  structure,  and  at  worst  impossible  because  crystals  ate  either  unsuitable  or  unawlable. 
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There  arc  two  degrees  of  Imdom  about  the  relative  positions  of  two  adjacent  planes  -  the 
<j>  and  rf)  angles  -  which  are  the  primitive  structural  descriptors  used  in  this  work.^ 

3  Approach 

PHI-PSI  was  developed  based  on  the  hypothesis  that  if  two  amino  acids  have  similar 
physical  properties  and  they  occur  in  a  similar  physical  environment,  then  they  should 
form  similar  structures  (e.g.  similar  angles).  Here  are  some  terms  used  in  the  following 
discussion:  (1)  test  protein  -  the  protein  whose  structure  is  going  to  be  predicted,  and 
whose  amino  acid  sequence  is  the  input  to  PHI-PSI]  (2)  database  -  all  the  protein  data 
we  have  except  the  test  protein;  here  both  the  amino  acid  sequences  and  the  4>  and  ^ 
angles  of  each  amino  acid  are  known;  (3)  window  -  one-dimensional  frame  of  slots  that  is 
to  be  overlaid  on  and  moved  over  a  test  protein  sequence  to  access  segments  of  its  amino 
acids. 

3.1  The  Basic  Algorithm 

I 

For  each  test  protein,  PHI-PSI  works  as  followi^: 

Step  1.  Specify  the  initial  parameters,  such  as  the  initial  window  sise  W,  the  window  weij^it 
pattern  P  (there  is  a  weight  associated  vrith  each  slot  in  the  window),  and  iV,  the  number 
of  best  matches  to  keep  (from  which  a  prediction  is  made),  etc.  (These  parameters  will  be 
discussed  in  more  detail  below.) 

Step  2.  Move  the  window  over  the  test  protein,  and  at  each  position,  extract  an  amino  acid 
segment  5  of  length  W,  and  do: 

1.  move  the  same  window  over  all  the  protein  sequences  in  the  database  and  generate  all 
the  possible  amino  acid  segments  of  length  w,  «  =  1,2,  ......m; 

2.  match  S  against  all  Sj.  •  =  1, 2,  ......m,  and  conq>ute  a  score  using  a  scoring  fimction 

which  win  be  describe  in  the  next  section; 

3.  select  the  N  segments  from  {si, ...,  Sm}  which  have  the  highest  N  scores.  The  prediction 
of  the  ^  and  ^  angles  of  5’s  centennost  amino  acid  is  xnade  by  majority  of  the  ^  and 
V*  values  of  the  ammo  acids  in  the  N  selected  segments. 

Step  3.  H  the  recursive  mode  is  chosen,  a4just  the  parameters  fe.g.  the  window  size)  and  repeat 
Step  2  unless  the  end  conditions  are  met  or  PHI-PSI  has  gone  throu^  a  pre-specified 
nuinber  of  recursive  levels. 

^  Most  work  in  protein  structure  prediction  has  focused  on  Mteondary  structures,  which  refer  to  the 
regular,  repetitive  spatial  arrangements  of  residues  that  are  close  to  one  another  in  the  polypeptide  chain. 
They  include:  (t)  a  helix,  where  the  polypeptide  chain  has  a  helical  shape  to  produce  a  rodlike  structure; 
{it)  p  sheet,  where  the  polypeptide  chain  is  almost  ftdly  extended;  {Hi)  turn,  where  within  a  few  (usuaUy 
4)  residues  the  polypeptide  chain  turns  ~  180°.  Coil  is  often  used  to  mean  "none  of  the  above.”  We  use  the 
angles  instead  in  our  research  for  several  reasons:  (1)  there  are  no  agreed  assignments  of  the  secondary 
structures  in  certain  cases;  (2)  <f>-tp  angles  compose  a  richer  vocabulary;  for  example,  some  biologists  have 
found  8  types  of  P  turns,  which  can  not  be  distinguished  if  one  just  calls  them  all  turns;  (3)  angles  can 
be  used  to  describe  any  part  of  a  protein,  not  just  the  secondary  structures. 


3.2  The  Similarity  Measure 

Given  two  patterns,  a  scoring  function  computes  a  score  to  represent  how  similar  they  are 
to  each  other.  To  define  a  scoring  function,  several  factors  need  to  be  considered:  (a)  the 
similarity  matrices,  which  specify  how  similar  the  primitive  components  (here,  the  amino 
acids)  are  to  each  other  in  terms  of  some  properties  (usually  one  matrix  for  each  property); 
(b)  the  weight  of  each  similarity  matrix,  which  represents  how  important  the  corresponding 
property  is  to  the  overall  similarity;  (c)  the  weights  associated  with  each  position  of  the 
pattern,  which  indicates  how  important  that  position  is  to  the  whole  pattern;  and  (d)  if 
the  matching  is  done  recursively,  how  strongly  the  previous  result  should  affect  the  next 
match.  The  following  is  the  function  used  by  PHI-PSI,  which  takes  two  segments  of  amino 
acids  X  =  XiXi  . . .  X^  Y  =  ...  as  arguments: 

n  m 

Score  =  Wmj  •  Yi]  -  Woi  •  max(\  |.  [  X*[^]  -  |)} 

i=l  i=l 

where  Xi  and  Yi  are  the  amino  acids  in  X  and  Y  respectively,  »  €  [!»»»];  Xi\4i\  means 
the  <f>  angle  of  Xi,  whose  value  is  within  [-180, 180);*  Wp,  is  the  window  slot  weight  for 
petition  »;  Sj  is  the  jth  similarity  matrix,  means  the  entry  value  for  amino  acid 

pair  (Xi,Yi);  Wmj  is  the  weight  for  Sj]  Wa,-  is' the  weight  for  previously-predicted  ^  and 
■0  angle  values,  which  is  zero  when  the  angles  are  unknown. 

What  this  function  does  is  the  following:  for  every  pair  (Xi,  K)  from  X  and  Y,  it  com¬ 
putes  the  weighted  sum  of  the  corresponding  entries  for  Xi  and  li  in  all  the  similarity  ma¬ 
trices,  and  if  in  recursive  mode,  subtracts  the  difference  between  the  previously-predicted 
and  currently-retrieved  0-0  values.  This  sum  is  taken  as  the  subseore  for  each  pair  (Xi,  li). 
The  function  then  computes  the  weighted  sum  of  all  the  subscores  as  the  score  of  matching 
the  two  segments  X  and  Y .  So,  the  matrices  for  important  properties  should  have  more 
weight,  and  the  important  positions  in  the  window  should  have  more  weight  in  order  for 
the  scores  to  reflect  the  structural  similarity  between  the  two  amino  acid  segments. 

We  have  used  10  amino  acid  properties,  such  as  size,  bydxophobicity,  polarity,  c<c.[16] 
A  20  X  20  similarity  matrix  is  computed  for  each  property  based  on  values  obtained  from 
biology  literature.  Each  matrix  entry  has  a  value  between  [0, 1],  representing  the  degree  of 
similarity  between  a  pair  of  amino  acids  in  terms  of  that  property.  The  diagonal  elements 
of  each  matrix,  that  is,  the  entry  for  pair  (A<,  Ai)  for  some  amino  acid  A.-,  should  have 
^ue  1.0.  However,  since  we  do  not  fully  understand  the  similarity  between  amino  acids 
in  terms  of  forming  protein  structures,  exact  matches  are  always  preferred.  To  emphasize 
this,  the  diagonal  dements  are  increased  to  1.5.* 

•  Note  that  -180®  is  equal  to  +180®  here. 

*  We  tested  various  values  between  [1, 2],  1.5  seems  to  give  the  best  prediction. 
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3.  3  The  Window  Size 


There  is  a  trade-off  between  the  amotxnt  of  information  and  the  level  of  noise  when  we 
choose  a  window  size.  The  larger  the  window,  the  more  information  it  contains,  but  when 
the  window  is  too  large,  the  matching  process  may  be  misled  by  the  ‘^noise,”  i.e.,  the 
irrelevant  sequentially  distant  information. 

The  technique  used  in  this  work  to  make  the  trade-off  is  to  start  with  a  small  window, 
and  to  increase  the  window  size  gradually.  At  each  level  of  recursion,  the  previous  predic¬ 
tion  from  a  smaller  window  is  used  in  finding  the  best  matches  for  the  next  prediction  of  a 
larger  window.  This  way,  we  can  catch  the  information  of  the  very  short  range  interactions 
as  well  as  take  into  account  the  somewhat  longer  range  interactions  between  the  amino 
acids  in  a  primary  sequence. 

To  a  first  approximation,  we  can  assume  that  the  structure  of  a  protein  is  solely  de¬ 
termined  by  the  interactions  among  its  amino  acids,  and  the  amino  acids  interact  with 
each  other  only  if  they  are  close  to  each  other  in  space.  For  every  position  i  in  a  pri¬ 
mary  sequence,  we  computed  the  probability  (approximated  by  the  frequency)  by  which 
the  residues  at  position  t  d:  1,  «  d:  2,  ...  are  within  7 A  from  residue  t.*  Residues  two 
positions  or  less  apart  on  the  amino  acid  sequences  are  almost  always  close  to  each  other, 
this  is  in  accordance  with  the  standard  bond  lehgth  and  configiiration;  residues  within  four 
positions  are  close  to  each  other  much  more  often  thain  are  the  rest.  Thus  either  5  or  7 
can  be  used  as  the  initial  window  size. 

3.4  The  Weight  Pattern 

The  weight  pattern  for  the  window  represents  the  influence  that  the  amino  acids  at  different 
positions  have  on  the  structure  of  the  amino  acid  in  the  center  of  the  window.  So  if  position 
t  is  at  the  center  of  the  window,  the  weight  of  position  t  should  be  in  proportion  to  the 
probability  by  which  residue  t  ±  j  stay  close  to  residue  i  (based  on  the  assumption  in  the 
last  section).  We  have  tested  different  window  weight  patterns;  a  typical  one  looks  like: 
...3  3445  443  3.... 


3.5  Find  the  Allies”  for  the  Best  Matches 

After  we  find  the  top  N  matches  (the  ones  with  the  highest  N  scores)  from  the  database 
for  an  input  amino  acid  segment,  we  need  a  ^Iray  to  m£kke  a  decision  about  which  <f>  and  ^ 
values  to  use  as  predictions  for  the  input. 

First,  why  do  we  bother  to  use  the  top  N  matches,  not  just  the  one  with  the  highest 
score?  Two  reasons:  (1)  Usually  the  top  matches  have  very  similar  scores.  Recall  that  the 
scoring  function  is  based  on  quite  a  few  factors.  We  are  not  quite  sure  about  the  exact 

*  7 A  is  toQghly  the  distance  within  which  two  residues  can  directly  interact  with  each  another. 
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weights  for  these  f&ctors.  The  actual  values  used  are  at  best  approxiinations  of  the  optimal 
values.  So  a  small  difference  in  the  score  may  not  mean  much.  (2)  Even  if  two  (short) 
segments  of  anuno  acids  match  exactly,  they  do  not  necessarily  form  the  same  structure  in 
two  proteins.  However,  if  among  the  top  N  matches,  the  majority  of  them  have  a  similar 
structure,  then  the  input  will  at  least  have  the  tendency  to  form  that  structure  also. 

PHI-PSI  makes  this  decision  in  the  following  way:  for  each  segment  in  the  top  N 
matches  it  finds  out  how  many  other  segments  in  this  group  have  similar  values  for  the 
centermost  residue;  these  are  called  its  “allies”  (the  threshold  for  “similar  values”  here  is 
another  parameter  that  can  be  adjusted).  The  one  with  the  largest  number  of  allies  is 
chosen  as  the  prediction. 


4  Discussions 

4.1  Selection  of  Data 

Homologous  proteins®  have  similar  amino  acid  sequences  and  structures.  So  if  two  proteins 
in  the  database  are  homologous,  then  when  we  try  to  predict  the  structure  of  one  of  them, 
we  almost  always  find  the  best  match  from  the  other.  This  way,  though  we  have  a 
high  prediction  accuracy,  the  result  is  deceiving.^  We  used  a  sequence  comparison  package 
developed  on  the  Connection  Machine  [8]  to  find  the  homologous  protein  clusters  and 
removed  all  but  one  sequence  (usually  the  longest  one  and/or  the  one  with  the  highest 
resolution)  from  each  cluster.  The  112  amino  acid  sequences  left  have  been  used  in  this 
work. 


4.2  Initial  Results  and  Analysis 

We  made  one  complete  run  of  PHI-PSI  on  the  whole  database,  that  is,  for  every 
acid  sequence  p  in  the  database: 

select  p  as  test  protein  and  use  the  rest  as  known  proteins; 
run  PHI-PSI  to  predict  p’s  structure; 

compare  the  prediction  with  p’s  real  structure  and  compute  the  prediction  errors. 

The  parameters  used:  initial  .window  ^ize  =  5;  recursiveJevel  =  5. 

This  involves  a  large  amount  of  computation.  For  example,  there  are  112  primary 
sequences  in  our  database,  which  contain  18T13  ammo  acids  altogether;  the  recursive  level 
is  set  to  5  and  at  each  level  the  window  size  is  increased  by  2,  so  on  average  there  are  10 
amino  acids  in  each  segment;  each  amino  acid  sequence  generates  (18713 -f  112)  — 10  157 

segments;  there  are  10  property  matrices;  for  each  amino  acid  segment  in  a  test  protein, 

®  Ptoteins  that  have  conunon  ancestors. 

Given  a  protein  vrath  unknown  structures,  it  would  be  very  helpfal  if  we  know  the  structure  of  a  protein 
that  is  homologous  to  it.  But  in  most  cases,  we  do  not  know  any  of  its  homcdogons  proteins. 


6 


PHI  matches  it  against  the  whole  database  (except  itsdf);  thus  the  total  number  of 
tab.  lookups  (i.e.  calls  to  the  similarity  matrix  entries)  for  a  complete  run  is: 

(112  X  157)  X  [(112  -  1)  X  157  X  10  X  10  X  5)]  =  153218184000  «  1.5  X  10'^. 

There  need  to  be  roughly  the  same  amoxmt  of  multiplications  (see  the  scoring  function  in 
section  3.2).  The  complete  run  took  about  100  hours  on  a  4K  Connection  Machine  without 
floating  point  processors  (FPPs).  With  FPPs  and  the  new  indirect-addressing  facility  for 
virtual  processors,  this  computation  should  be  4  times  as  fast.®  Also,  the  algorithm  speeds 
up  linearly  with  the  number  of  processors.  ' 

The  prediction  errors  are  computed  in  terms  of  ^  and  ^  angles.  There  are  several  ways 
to  measure  the  errors,  such  as:  (1)  residue  errors  -  the  difference  between  the  real  angle 
values  computed  £rom  the  3-D  coordmates  and  the  values  predicted  by  the  algorithm  for 
a  particular  residue  in  a  protein;  (2)  overall  errors  -  the  average  of  the  residue  errors  of 
all  the  proteins  in  the  database. 

4.2.1  The  Overall  Errors 

The  overall  errors  of  the  complete  run  are:  ^jtTTor  =  37.7°,  ^^vror  =  63.7°.  For  com¬ 
parison,  the  average  differences  of  <l>  and  ^  angle  values  among  1000  randomly  selected 
residues  in  the  database  are:  randomjf>jiiffei^ence  w  56°,  randarh^jiifference  »  89°. 
The  prediction  errors  are  considerably  smaller  than  the' random  differences,*  which  demon¬ 
strates  that  our  algorithm  can  really  find  some  correspondences  between  segments  of  amino 
acids  and  their  structure.^* 

4.2.2  The  Residue  Errors 

Figure  3  shows  the  average  residue  errors  (the  dark  curves)  and  the  **random  differences” 
(the  thin  lines)  for  all  the  angle  values.  The  random  difference  for  angle  value  F  (V  € 
(—180,180))  is  computed  by  randomly  picking  up  1000  residues  from  the  database  and 
calculating  the  average  difference  between  F  and  these  angle  values.  A  comparison  of  the 
prediction  error  curve  with  the  random  difference  curve  can  pve  us  an  idea  about  how 
well  PEI-PSI  does  with  each  particular  angle  value.  Put  another  way,  for  each  residue  in 
a  test  protein,  the  “error”  of  a  random  guess  (based  only  on  the  distribution  of  the 
angle  values  in  the  database)  has  the  greatest  probability  to  fall  on  the  random  difference 
curve.  So  the  vertical  distance  between  the  random  difference  curve  and  the  prediction 
error  curve  shows  how  much  better  PHI-PSI  does  than  distribution-based  guessing. 

^  This  estimate  is  based  on  most  multiplications  being  carried  out  on  20-bit  integers. 

®  The  “random  differences”  are  not  totally  random,  they  are  determined  by  the  distribution  of  values 
in  thf  database. 

otice  that  with  retdluiion  =  2.5A,  the  <f>  and  ^  angles  can  only  be  measured  with  accuracy  around 
20°  30° .  Quite  a  &w  protein  structures  in  our  database  have  that  resolution. 
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From  figure  3  we  can  see  that  PHI-PSI  could  do  pretty  well  for  ^  aroTind  —60“  and  for 
V*  around  —40“  and  110“.  PHI-PSI  did  not  do  very  well  for  <f>  in  region  [130, 170]  and  for  ijf 
in  regions  [-170,-140]  and  [70,90].  In  figure  2  (right)  we  plotted  the  distribution  of 
all  the  <f>  and  ij)  angle  values  in  our  databeise  (which  is  called  Ramackandran  plot).  We  can 
observe  the  following  phenomena:  PHI-PSI  did  much  better  in  dense  regions  than  in 
sparse  regions.  Recall  that  our  prediction  algorithm  works  by  finding  similar  segments 
of  amino  acids  for  each  one  in  a  test  protein.  When  the  angles  of  an  amino  acid  in 
the  test  protein  is  in  a  dense  region,  there  are  a  lot  of  precedents  in  the  database,  thus 
PHI-PSI  is  likely  to  find  ones  similar  to  it;  when  it  is  in  a  sparse  region,  there  are  simply 
not  enough  precedents  for  a  good  prediction.  The  ^  and  if?  angles  of  a  helix  and  sheet^^ 
also  fall  into  the  dense  regions,  so  another  explanation  for  the  observed  phenomena  is  that 
there  is  a  closer  correlation  between  amino  acid  segments  and  their  structure  for  helices 
and  sheets  than  for  other  parts  of  a  protein. 

5  Related  Work 

The  idea  of  memory- based  reasoning  is  related  to  the  theories  on  analogy  [17],  dynamic 
memory  [13],  case-based  reasoning  [7]  [3],  and  instance-base  reasoning  and  learning  [6]. 
One  common  theme  among  all  these  theories  that  one  can  solve  a  problem  by  recalling 
one  or  more  related  precedents  and  deriving  a  solution  based  on  them. 

Most  heuristic  methods  for  protein  structme  prediction  have  focused  on  the  s^ondary 
structures  and  have  adopted  a  “local  approach,”  i.e.,  to  predict  a  residue’s  structure  based 
on  its  neighboring  residues  along  the  primary  sequence.  Chou  &  Fasman  [1]  used  the  fre¬ 
quencies  by  which  each  amino  acid  appears  in  a  helix  and  sheet  to  predict  the  helices 
and  sheets  in  a  new  protein.  Garnier  et  al.  [2]  computed  the  correlation  of  residue  j  -|-  m 
and  the  “state”  of  residue  j  (one  of  {Ac/iz,  sheet,  coti}),  for  m  6  [1,8],  and  used  this 
information  in  prediction.  Rooman  &  Wodak  [12]  developed  a  set  of  “sequence  motifs” 
(amino  acid  patterns)  for  predicting  the  secondary  structures,  and  their  conclusion  is  that 
the  identification  of  predictive  sequence  motifs  is  limited  by  the  size  of  the  currently  avail¬ 
able  data.  There  have  been  several  attempts  to  use  the  known  data  directly  in  prediction. 
Levin  et  al.  [10]  used  a  window  of  length  7  to  predict  the  secondary  structures  of  every 
amino  acid  inside  the  window.  Sweet  [15]  used  a  window  of  length  12  and  made  use  of  the 
probability  distribution  of  4>-i>  angles  to  predict  the  secondary  structures.  These  direct 
methods,  including  our  method,  have  the  advantage  that  their  prediction  accuracies  are 
likely  to  increase  as  the  size  of  the  available  data  increeises.  Qian  &  Sejnowski  [11]  applied 
the  back-propagation  algorithm  to  the  prediction  of  a  helix  and  sheet.  They  got  ~  64% 
accuracy  for  15  test  sequences,  which  they  claimed  is  the  best  result  so  far. 

A  few  Al  researchers  have  attached  the  3-D  structure  prediction  problem.  Hayes- 

See  footnote  2  for  definitions. 
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Roth  ti  al.  [4]  tued  multiple  sources  of  information  (mainly  through  nuclear  magnetic 
resonance  (NMR);  they  also  assumed  that  a  proton’s  secondary  structures  are  known) 
and  the  blackboard  control  architecture  to  identify  legal  positions  for  each  of  a  protein’s 
constituent  structures  (atoms,  amino  acids,  helices,  etc.),  which  usually  results  in  a  large 
ntunber  of  possible  structures.  Currently  the  NMR  techniques  are  limited  to  proteins  with 
<  100  residues.  The  ARIADNE  system  [9]  is  another  interesting  effort.  It  is  essentially 
a  recognition  system  that  uses  hierarchical  representation  of  protein  structures  to  decide 
whether  a  primary  sequence  can  fold  into  a  ^ven  3-D  structure. 


6  Summary  and  Future  Research 

This  paper  reports  our  initial  work  on  the  prediction  of  protein  structures  by  Memory-based 
Reasoning.  There  has  not  been  much  work  done  in  predicting  the  angles  directly  from 
the  known  data.  We  feel  that  this  is  an  important  direction  to  explore,  and  that  Memory- 
based  Reasoning  is  the  right  technique  for  this  task.  We  will  continue  our  research  in 
several  directions,  among  them  are: 

(1)  Currently  in  PHI-PSI  when  we  compute  the  wrighted  sum  of  matrix  entries  and  the 
weighted  sum  of  subscores,  all  the  segments  ii^  the  database  share  the  same  two  sets  of 
weights:  one  set  associated  with  the  similarity  matrices,  the  other  associated  with  the 
window.  This  may  not  be  the  optimal  way  to  do  it,  because  the  ^‘important  properties” 
anH  the  “important  positions”  may  be  different  in  each  case,  and  even  the  window  size  for 
each  case  does  not  necessarily  have  to  be  the  same.  We  are  developing  learning  algorithms 
to  automatically  identify  these  optimal  parameters  for  each  case. 

(2)  We  have  used  a  similar  network  structure  as  in  [11]  to  predict  the  angles  of  a  few 
sequences,  and  foimd  the  overall  accuracy  was  very  close  to  that  of  PHI-PSFs.  However, 
their  predictions  for  each  particular  sequence  are  not  the  same,  which  suggests  that  the 
combination  of  multiple  methods  nuty  give  better  results  than  any  single  one.  We  plan  to 
implement  a  system  which  uses  multiple  methods  and  to  derive  a  way  to  combine  their 
results  by  carefully  exaTnining  under  what  conditions  each  method  is  likely  to  give  correct 
prediction. 

(3)  Amino  acids  in  a  protein  interact  with  others  that  are  dose  to  them  in  space,  some  of 
which  may  be  far  away  along  the  primary  sequence.  It  has  been  known  for  a  long  time 
that  this  kind  of  “global  interaction”  is  crucid  for  protein  folding  and  structural  stability. 
One  way  to  take  into  account  this  global  interaction  is  to  use  the  information  about  the 
super-secondary  structures.**  For  future  research,  in  PHI-PSI,  each  known  amino  acid 
sequence  will  be  associated  with  not  only  the  <f>  and  tjf  an^es,  but  alsc  the  secondary  and 

A  super-secondary  structure  is  a  group  of  secondary  structures  that  are  arranged  in  some  more  or  less 
regular  way. 
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super-secondary  structure  information.  PHI-PSI  will  include  this  higher-order  structure 
information  in  finding  best  matches  recursively. 
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Amino  ffOup 


carboxyl  froup 
•Ide  chain 


Figure  1:  Left  —  The  chemical  structure  of  an  amino  acid.  Right  —  An  amino  acid 

sequence. 


Figure  2:  Left  —  The  <f>  and  ^  angles.  Right  —  The  distribution  of  <f>  and  ^  angle  values 
in  our  database. 


Figure  3:  Left  -  Residue  error  curve  for  <f>.  Right  -  Residue  error  curve  for  ip. 
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corporations  and  academic  institutions.  Teachers,  administrators,  scientists,  and  parents  who  want  to  expand 
the  sphere  of  science  education  in  their  schools  will  want  this  document. 

NTIS  is  distributing  these  new  guides  from  the  National  Science  Resource  Center,  which  is  a  partnership 
between  the  Smithsonian  Institution  and  the  National  Academy  of  Sciences  dedicated  to  improving  science 
education  in  schools.  These  documents  are  available  at  a  special  price  when  you  order  both. 

To  order  the  set,  use  Order  Number  PB97-163109LPH 

$79.50  plus  handling  fee.  Orders  outside  the  U.S.,  Canada,  and  Mexico  $159  plus  handling  fee. 

Science  for  all  Children:  A  Guide  to  Improving  Elementary  Science  Education  in  Your  School  District 

Order  Number:  PB97^138010LPH 

$44  plus  handling  fee.  Orders  outside  the  U.S.,  Canada,  and  Mexico  $88  plus  handling  fee. 

Resources  for  Teaching  Elementary  School  Science 

An  excellent  reference  guide  to  350  inquiry-based  curriculum  packages. 

Order  Number:  PB96-184254LPH 

$49  plus  handling  fee.  Orders  outside  the  U.S.,  Canada,  and  Mexico  $98  plus  handling  fee. 

ENERGY 


U.S.  Nuclear  Regulatory  Commission  Regulatory  Guide  1.160,  Revision  2, 

Monitoring  the  Effectiveness  of  Maintenance  at  Nuclear  Power  Plants 

U.S.  Nuclear  Regulatory  Commission,  Office  of  Nuclear  Regulatory  Research,  Washington  DC 

The  U.S.  Nuclear  Regulatory  Commission  Regulatory  Guide  series  makes  available  to  the  public  methods  of 
implementing  specific  parts  of  the  commission’s  regulations.  It  describes  techniques  used  by  the  NCR  in 
evaluating  specific  problems,  and  it  provides  guidance  to  applicants  who  are  involved  with  nuclear  reactors. 

This  guide  is  available  as  an  ongoing  subscription.  Call  the  NTIS  Subscriptions  Department  at 
(703)  487-4630  for  pricing. 

Order  number:  PB97-926501LPH (for  single  issue) 

$10  plus  handling  fee.  Outside  the  U.S.,  Canada,  and  Mexico  $20  plus  handling  fee. 

Prices  are  subject  to  change,  NTIS  Sales  Desk:  (703)  487-4650 


NTIS  World  Wide  Web  address:  http://www.ntis.gov 


TRANSPORTATION 


Navigational  and  Vessel  Inspection  Circulars  and 
the  Merchant  Vessels  of  the  United  States  (on  CD-ROM) 

U.S.  Coast  Guard 

This  new  CD-ROM  contains  a  fully  searchable  set  of  all  4,000  pages  of  Navigational  and  Vessel  Inspection 
Circulars  published  by  the  U.S.  Coast  Guard  between  July  1952  and  May  1996  and  is  available  from  NTIS. 
Also  included  on  the  CD-ROM  is  the  current  USCG  database  of  merchant  vessels  registered  in  the  U.S. 

NVICs  are  published  by  the  Coast  Guard  to  assists  marine  safety  personnel  and  the  marine  industry  by 
clarifying  and  expanding  upon  commercial  vessel  safety  requirements.  The  documents  are  provided  as  both 
text  and  image  files  to  allow  full-text  searching  as  well  as  viewing,  printing,  and  faxing  on  any  PC  running 
Windows  with  a  CD-ROM  reader. 

Individual  Navigational  and  Vessel  Inspection  Circulars  are  available  in  paper  copy  from  NTIS. 

For  further  information,  call  NTIS  at  (703)  487-4650  or  visit  NTIS  Web  site 
http://www.ntis.gov/business/nvic.htm. 

Order  Number:  PB97-500664LPH 

$40  plus  handling  fee. 

Orders  outside  the  U.S.,  Canada,  and  Mexico  $80  plus  handling  fee. 


ENVIRONMENT 


Water  Test  Methods  and  Guidance  from  EPA  (on  CD-ROM) 

U.S.  Environmental  Protection  Agency 

s  Office  of  Water  has  taken  the  initiative  to  provide  its  methods  and  guidance  documents  on 
CD-ROM.  The  CD-ROM  contains  more  than  330  drinking  water  and  wastewater  methods  and  guidance  from 
over  50  EPA  documents  including:  MCAWW;  Metals,  Inorganic  and  Organic  Substances  in  Environmental 
Samples;  40  CFR  Part  136  Appendix  A,  B,  C  &  D;  500,  600,  and  1600  series;  Whole  Effluent  Toxicity 
Methods. 

The  CD-ROM  contains  search  and  retrieval  software  and  requires  WINDOWS  3.1  or  greater  or  Mac  68020 
processor  or  greater. 

Order  Number:  PB97-501308LPH 

$60  plus  handling  fee. 

Outside  the  U.S.,  Canada,  and  Mexico  $90  plus  handling  fee. 


Prices  are  subject  to  cbange. 


Nns  Sales  Desk:  C703)  487-4650 


Nns  World  Wide  Web  address:  http://www.ntis.gov 
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Successor  to  the  U.S.  Industrial  Outlook  —  the  most  widely  ^ 

read  and  respected  single  source  guide  to  U.S.  industry 

Content  includes: 

•  50  chapters  covering  most  important  monufaduring  and  nonmnnufncturing  sectors! 

•  New  industries  not  previously  covered  such  as  electricily  production! 

•  Expanded  coverage  in  both  manufacturing  and  nonmanufacturing  industries! 

•  Charts  for  each  chopter-provide  a  quick  look  at  economic  and  trade  trends ! 
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